week 2 exercise - part 1
Basic visualization with Matplotlib¶
Matplotlib¶
First, we import the required libraries, using standard conventions. We first import numpy for all our mathematical needs, then the matplotlib as plotting library and pyplot which gives an easy API to create plots with matplotlib. Later we will introduce Seaborn as well.
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
# we need the following line to indicate that the plots should be shown inline with the Jupyter notebook.
%matplotlib inline
We will first create a simple plot of a mathematical function. We first create a numpy array of x-values. Then for each x-value we create the y-value, i.e. the function value. Plotting this function is as easy as giving it the x and y values.
X = np.linspace(-np.pi, np.pi, 100) # define a NumPy array with 100 points in the range -Pi to Pi
Y = np.sin(X) # define the curve Y by the sine of X
plt.plot(X,Y); # use matplotlib to plot the function
While creating such plots is perfectly fine when you are exploring data, in your final notebook the plot is hard to understand for the reader. With matplotlib it is very easy to add labels, a title and a legend. You can also change the limits of the plot, the style of the lines and much more.
The following could be seen as the bare minimum for a plot to be understood as part of reproducible research.
plt.plot(X, Y, 'r--', linewidth=2)
plt.plot(X, Y/2, 'b-', linewidth=2)
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Plot Title')
plt.xlim(-4, 4)
plt.ylim(-1.2, 1.2)
plt.legend(['red curve', 'blue curve'], loc='best')
<matplotlib.legend.Legend at 0x259ff990380>
Go to the documentation pages of Matplotlib http://matplotlib.org/contents.html to find all the possible options for a plot and also to see more tutorials, videos and book chapters to help you along the way.
Another nice tutorials:
This assignment first shows you how to download csv data from an online source. Then we're exploring a dataset of all the cities in the world and compare cities in The Netherlands to the rest of the world.
Loading data CSV and Pandas¶
We will work with a database of information about cities around the world:
https://dev.maxmind.com/geoip/geoip2/geolite2/
Working with data structures can be done in many ways in Python. There are the standard Python arrays, lists and tuples. You can also use the arrays in the numpy package which allow you to do heavy math operations efficiently. For data analysis Pandas is often used, because data can be put into so-called dataframes. Dataframes store data with column and row names and can easily be manipulated and plotted. You will learn more about Pandas in the Machine Learning workshops. A short intro can be found here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
import urllib.request as urllib, zipfile, os
url = 'https://github.com/CODAIT/redrock/raw/master/twitter-decahose/src/main/resources/Location/'
filename = 'worldcitiespop.txt.gz'
datafolder = 'data/'
downloaded = urllib.urlopen(url + filename)
buf = downloaded.read()
try:
os.mkdir(datafolder)
except FileExistsError:
pass
with open(datafolder + filename, 'wb') as f:
f.write(buf)
import pandas as pd
# reading files may cause problems or give errors... Can you explain the use of the encoding parameter?
cities = pd.read_csv(datafolder + filename, sep=',', low_memory=False, encoding = 'ISO-8859-1')
Data Manipulation¶
We can take a peek at the data by checking out the final rows of data. Do you see any potential problem with this dataset?
cities = cities.dropna(subset=['Population'])
cities.tail()
| Country | City | AccentCity | Region | Population | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 3173646 | zw | redcliffe | Redcliffe | 06 | 38231.0 | -19.033333 | 29.783333 |
| 3173676 | zw | rusape | Rusape | 04 | 23761.0 | -18.533333 | 32.116667 |
| 3173737 | zw | shurugwi | Shurugwi | 07 | 17107.0 | -19.666667 | 30.000000 |
| 3173892 | zw | victoria falls | Victoria Falls | 00 | 36702.0 | -17.933333 | 25.833333 |
| 3173957 | zw | zvishavane | Zvishavane | 07 | 79876.0 | -20.333333 | 30.033333 |
cities.sort_values(by='Population', ascending=False).head(20)
| Country | City | AccentCity | Region | Population | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 1544449 | jp | tokyo | Tokyo | 40 | 31480498.0 | 35.685000 | 139.751389 |
| 570824 | cn | shanghai | Shanghai | 23 | 14608512.0 | 31.045556 | 121.399722 |
| 1327914 | in | bombay | Bombay | 16 | 12692717.0 | 18.975000 | 72.825833 |
| 2200161 | pk | karachi | Karachi | 05 | 11627378.0 | 24.905600 | 67.082200 |
| 1349146 | in | new delhi | New Delhi | 07 | 10928270.0 | 28.600000 | 77.200000 |
| 1331162 | in | delhi | Delhi | 07 | 10928270.0 | 28.666667 | 77.216667 |
| 2130459 | ph | manila | Manila | D9 | 10443877.0 | 14.604200 | 120.982200 |
| 2461968 | ru | moscow | Moscow | 48 | 10381288.0 | 55.752222 | 37.615556 |
| 1626528 | kr | seoul | Seoul | 11 | 10323448.0 | 37.598500 | 126.978300 |
| 316800 | br | sao paulo | SĆ£o Paulo | 27 | 10021437.0 | -23.473293 | -46.665803 |
| 2800596 | tr | istanbul | Istanbul | 34 | 9797536.0 | 41.018611 | 28.964722 |
| 2003442 | ng | lagos | Lagos | 05 | 8789133.0 | 6.453056 | 3.395833 |
| 1892345 | mx | mexico | Mexico | 09 | 8720916.0 | 19.434167 | -99.138611 |
| 1186762 | id | jakarta | Jakarta | 04 | 8540306.0 | -6.174444 | 106.829444 |
| 2990572 | us | new york | New York | NY | 8107916.0 | 40.714167 | -74.006389 |
| 362418 | cd | kinshasa | Kinshasa | 06 | 7787832.0 | -4.300000 | 15.300000 |
| 842667 | eg | cairo | Cairo | 11 | 7734602.0 | 30.050000 | 31.250000 |
| 2074194 | pe | lima | Lima | 15 | 7646786.0 | -12.050000 | -77.050000 |
| 553246 | cn | peking | Peking | 22 | 7480601.0 | 39.928889 | 116.388333 |
| 996635 | gb | london | London | H9 | 7421228.0 | 51.514125 | -0.093689 |
By sorting the cities on population we immediately see the entries of a few of the largest cities in the world.
Assignment 1a¶
To get an idea of where in the world the cities in the dataset are located, we want to make a scatter plot of the position of all the cities in the dataset.
Don't worry about drawing country borders, just plot the locations of the cities.
Remember to use all the basic plot elements you need to understand this plot.
import numpy as np
from matplotlib import pyplot as plt
plt.scatter(cities['Longitude'],cities['Latitude'])
plt.title('Positions of Cities in the Dataset')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.figure();
<Figure size 640x480 with 0 Axes>
Assignment 1b¶
Create a visualization to show the top-20 cities with the highest population.
Remember to use all the basic plot elements you need to understand this plot.
top_cities = cities.sort_values(by='Population', ascending=False).head(20)
plt.scatter(cities['Longitude'],cities['Latitude'], color='gray')
plt.scatter(top_cities['Longitude'],top_cities['Latitude'], color='red')
plt.title('Positions of Cities in the Dataset')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.figure();
<Figure size 640x480 with 0 Axes>
Assignment 1c¶
Now we want to plot the cities in The Netherlands only. Use a scatter plot again to plot the cities, but now vary the size of the marker and the color with the population of that city.
Use a colorbar to show how the color of the marker relates to its population.
Use sensible limits to your axes so that you show only mainland The Netherlands (and not the Dutch Antilles).
dutch_cities = cities[ cities['Country'] =='nl' ]
max_population = dutch_cities['Population'].max()
size_marker = [20 * n / (max_population / 50) for n in dutch_cities['Population']]
plt.figure(figsize=[7,7]);
plt.xlim(3.2, 7.5)
plt.ylim(50.6,53.6)
cmap='viridis'
plt.scatter(dutch_cities['Longitude'],dutch_cities['Latitude'],s=size_marker, c=dutch_cities['Population'], cmap='plasma')
plt.colorbar(label='Population')
plt.title('Positions of Dutch Cities in the Dataset')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.figure();
## Your code and explanation in comments...
<Figure size 640x480 with 0 Axes>
Assignment 1d¶
Looking at the previous assignment, we can see larger cities such as Amsterdam, Rotterdam and even Eindhoven. But we still do not really have a clear overview of how many big cities there are. Create a visualisation to show the distribution of the population for all Dutch cities.
Add proper basic plot elements to this plot and add an annotation to indicate Amsterdam and Eindhoven in this distribution.
## Your code and explanation in comments...
dutch_cities = cities[ cities['Country'] =='nl' ]
plt.hist(dutch_cities['Population'], color='blue', edgecolor='black',bins=50)
plt.xlabel('Population')
plt.ylabel('Count')
plt.title('Population Distribution of Dutch Cities')
plt.grid(True)
# Add annotations for Amsterdam and Eindhoven
amsterdam_population = dutch_cities[dutch_cities['City'] == 'amsterdam']['Population']
eindhoven_population = dutch_cities[dutch_cities['City'] == 'eindhoven']['Population']
plt.annotate('Amsterdam', xy=(amsterdam_population, 3), xytext=(amsterdam_population - 100000, 15),
arrowprops=dict(facecolor='black', arrowstyle='->'))
plt.annotate('Eindhoven', xy=(eindhoven_population, 3), xytext=(eindhoven_population + 1000, 15),
arrowprops=dict(facecolor='black', arrowstyle='->'))
plt.figure();
<Figure size 640x480 with 0 Axes>
Assignment 1e¶
Now we want to compare how the distribution of Dutch cities compares to that of the entire world.
Use subplots to show the dutch distribution (top plot) and the world distribution (bottom plot).
plt.figure(figsize=[20, 8]);
plt.subplot(2,1,1);
plt.hist(np.asarray(dutch_cities.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.1)
# Add the subplot of the world cities below this Dutch one
plt.subplot(2,1,2)
plt.hist(np.asarray(cities.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.1)
## Your code and explanation in comments...
plt.title('Distribution of World Cities')
plt.xlabel('Population (thousands)')
plt.ylabel('Density')
Text(0, 0.5, 'Density')
Assignment 1f¶
Write what conclusions you can deduce from the above plots?
# That dutch cities seem more evenly spread with population compared to all cities where we can see more people living in small areas.
Assignment 2¶
Create a data visualization to compare the top-3 largest cities for Japan, Germany and your own (home) country. Add a clear conclusion about the comparison.
cities_in_Japan = cities[cities['Country'] == 'jp']
cities_in_Germany = cities[cities['Country'] == 'de']
cities_in_Bulgaria = cities[cities['Country'] == 'bg']
plt.figure(figsize=[20, 12]);
plt.subplot(3,1,1);
plt.hist(np.asarray(cities_in_Japan.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.15)
plt.subplot(3,1,2)
plt.hist(np.asarray(cities_in_Germany.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.15)
plt.subplot(3,1,3)
plt.hist(np.asarray(cities_in_Bulgaria.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.15)
plt.suptitle('Distribution of Japan, Germany and Bulgaria', fontsize=16)
plt.xlabel('Population (thousands)', fontsize=14)
plt.ylabel('Density', fontsize=14)
# When comparing the population distributions of cities in Japan, Germany and Bulgaria theres a bit of contrast. Going through the cities of Japan we can see that most of the
# is living in the bigger cities. Germany's population on the other hand is more focused on middle-sized cities. And when we check out Bulgaria we can see that most of the
# lives in small cities. If we go from Japan through Germany and to Bulgaria there is a downward trade of living in big cities.
Text(0, 0.5, 'Density')
week 2 exercise - part 2
Data visualization (part 2): Two additional Chart Types for Exploring¶
This assignment first shows two useful chart types: parallel coordinates and scatter matrix. You will practice these plots using a new dataset.
Parallel Coordinates with Pandas¶
First, we import the required libraries, using standard conventions. For the example of parallel coordinates we shall use the famous iris data set, describing the sepal and petal dimensions for three types of irises.
import pandas as pd
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', sep=',', low_memory=False, encoding = 'ISO-8859-1', header=None)
iris.columns = ['sepal width','sepal length','petal width','petal length', 'name']
iris.head()
| sepal width | sepal length | petal width | petal length | name | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Now we do not use matplotlib directly but use a plot function of the pandas library that uses matplotlib in the background. In this case we create a parallel coordinates plot.
Pandas has many plotting function as can be seen here: http://pandas.pydata.org/pandas-docs/stable/visualization.html#parallel-coordinates
The parallel coordinates plot can give insight into a dataset with a large number of features. For the iris set there are four features (petal width, petal length, sepal width, sepal length).
While you can make a scatter plot with 4 features using x,y,color and size; a parallel coordinates plot is usually easier to understand once you know how to read it. Here would be the scatter plot:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
fig = plt.figure()
plt.scatter(iris['petal width'], iris['petal length'], c=iris['sepal width'], s=iris['sepal length']**4)
plt.xlabel('petal width [cm]')
plt.ylabel('petal height [cm]')
plt.colorbar(label='sepal width [cm]');
import numpy as np
from matplotlib import pyplot as plt
from pandas.plotting import parallel_coordinates
%matplotlib inline
fig = plt.figure(figsize=[15,6])
ax = parallel_coordinates(iris,'name')
ax.set_ylabel('width/length [cm]');
Scatter Matrix with Pandas¶
A scatter matrix is a chart that gives you an overview of the correlations between any number of feaures.
from pandas.plotting import scatter_matrix
scatter_matrix(iris, alpha=1, figsize=(12, 12), diagonal='kde');
# or see what happens if we use the Seaborn library...
sns.pairplot(iris)
<seaborn.axisgrid.PairGrid at 0x259ebaccd70>
# Seaborn provides some simples ways to explore the data and correlations in more (visual) detail...
sns.pairplot(iris, hue="name")
<seaborn.axisgrid.PairGrid at 0x259eb394d40>
Assignment 3¶
Now try to create similar plots for a new dataset about car features.
# The data file is quite nasty with several different delimeters that read_csv cannot handle very well
names=['mpg','cylinders','displacement','horsepower','weight','acceleration','model year','origin','car name','j','k','l','m','n']
cars = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data', delimiter=r"\s+", names=names, header=None, engine='python')
# Create a subset of dataset with all useful features
cars = cars.iloc[:,[0,1,2,4,5,6,7]]
scatter_matrix(cars, alpha=1, figsize=(12, 12), diagonal='kde');
sns.pairplot(cars)
sns.pairplot(cars, hue='origin')
<seaborn.axisgrid.PairGrid at 0x259e52c4b00>
Create a normalized dataset¶
using Mean normalization (see: https://en.wikipedia.org/wiki/Feature_scaling#Mean_normalization)
cars_norm = (cars - cars.mean())/cars.std()
Next, create a parallel coordinates plot. What happens when you do not use the normalized data?
## Create the parallel coordinates plot here
import numpy as np
from matplotlib import pyplot as plt
from pandas.plotting import parallel_coordinates
%matplotlib inline
fig = plt.figure(figsize=[15,6])
ax = parallel_coordinates(cars_norm,'origin', color=('blue', 'green', 'red'))
ax.set_ylabel('Values');
Answer this question: What conclusions can you make from the relation between weight and acceleration? If you don't understand how to interpret parallel coordinates plots, read: https://eagereyes.org/techniques/parallel-coordinates.
# In conclusion the relation between weight and acceleration is very logical: the more the weight the less the acceleration and of course the opposite, the less weight the more
# acceleration
Next, try to highlight the model years >= 80.
Hints:
- you can slice your data with
cars_norm[cars['model year']>=80]. - you can plot both all data and the sliced data on top of each other with different colors
## Create the parallel coordinates plot here
import numpy as np
from matplotlib import pyplot as plt
from pandas.plotting import parallel_coordinates
%matplotlib inline
cars_after_80 = cars[cars['model year']>=80]
cars_norm_after_80 = (cars_after_80 - cars_after_80.mean())/cars_after_80.std()
fig = plt.figure(figsize=[15,6])
# ax = parallel_coordinates(cars_norm,'origin', color=('darkblue', 'gray', 'purple'))
ax = parallel_coordinates(cars_norm_after_80,'origin',color=('blue', 'green', 'red'))
plt.title('Parallel Coordinates Plot')
plt.xlabel('Features')
plt.ylabel('Values')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend(loc='upper right')
fig = plt.figure(figsize=[15,6])
<Figure size 1500x600 with 0 Axes>
Answer this question: what conclusions can you draw from cars with model years 80-82?
## Low mpg for high cylinders, slightly better mpg for lower cylinders but not that significant difference. Weight too acceleration ratio seems a bit more balanced when
# compared to all models.
Now, create a scatter matrix for the car data. Do we need to use the normalized data? Are we looking for a dataset that we can easily cluster or will we get more luck looking for trends?
## Create the scatter matrix here
scatter_matrix(cars_norm, alpha=1, figsize=(12, 12), diagonal='kde');
What are your final conclusions looking at the (visual) results? What did you learn about the data and dataset? Or what new questions did you derive from the plots you've made?
## Final conclusion is that I would go looking for trends rather than clustering.I learnt that there are relationships between mpg and cylinders and mpg and weight.